Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel
نویسندگان
چکیده
Even after decades of software engineering research, complex computer systems still fail, primarily due to nondeterministic bugs that are typically resolved by rebooting. Conceding that Heisenbugs will remain a fact of life, we propose a systematic investigation of restarts as “high availability medicine.” In this paper we show how recursive restartability (RR) — the ability of a system to gracefully tolerate restarts at multiple levels — improves fault tolerance, reduces time-to-repair, and enables system designers to build flexible, highly available software infrastructures. Using several examples of widely deployed software systems, we identify properties that are required of RR systems and outline an agenda for turning the recursive restartability philosophy into a practical software structuring tool. Finally, we describe infrastructural support for RR systems, along with initial ideas on how to analyze and benchmark such systems.
منابع مشابه
Designing for High Availability and Measurability
We propose a structuring model, called recursive restartability, aimed at controlling the amount of endto-end unavailability and improving the measurability of software infrastructures with high availability requirements. Recursive restartability exploits the benefits of restarts at various levels within complex software systems and relies on an execution infrastructure to monitor, cure, and re...
متن کاملReducing Recovery Time in a Small Recursively Restartable System
We present ideas on how to structure software systems for high availability by considering MTTR/MTTF characteristics of components in addition to the traditional criteria, such as functionality or state sharing. Recursive restartability (RR), a recently proposed technique for achieving high availability, exploits partial restarts at various levels within complex software infrastructures to reco...
متن کاملFault tolerant system with imperfect coverage, reboot and server vacation
This study is concerned with the performance modeling of a fault tolerant system consisting of operating units supported by a combination of warm and cold spares. The on-line as well as warm standby units are subject to failures and are send for the repair to a repair facility having single repairman which is prone to failure. If the failed unit is not detected, the system enters into an unsafe...
متن کاملComparing the complication of electrocautery and scalpel methods in surgical incisions of hysterectomy surgery
Aims: Surgical incisions have been performed by a scalpel for many years, and today the use of an alternative method, catheter incision, is increasing daily. This study aimed to evaluate the complications of use of electrocautery and scalpel in surgical incisions of the abdominal wall in hysterectomy surgery Materials and Methods: The present study is single-blind that was performed on 92 elig...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001